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Abstract 

This paper analyzes the suitability of reinforcement learning for both 
programming and adapting situated agents. In the the first part of 
the paper we discuss two specific reinforcement learning algorithms: 
Q-learning and the Bucket Brigade. We introduce a special case of 
the Bucket Brigade, and analyze and compare its performance to Q- 
learning in a number of experiments. The second part of the paper 
discusses the key problems of reinforcement learning: time and space 
complexity, input generalization, sensitivity to parameter values, and 
selection of the reinforcement function. We address the tradeoff be- 
tween the amount of built in and learned knowledge in the context 
of the number of training examples required by a learning algorithm. 
Finally, we suggest directions for future research. 
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1 Introduction 

Reinforcement learning (RL) has recently grown in pop- 
ularity as the learning methodology of choice in the situ- 
ated agent community. RL is appealing because it allows 
the agent to adapt to its environment as it gains infor- 
mation over time. It is particularly well suited for action 
learning, which is the main concern in control of situated 
agents. 

However, reinforcement learning suffers from a num- 
ber of problems which are in conflict with the goals of 
situated agent control. This paper analyzes the suitabil- 
ity of the general approach by using an in depth com- 
parison of two RL methods: Q-learning, introduced by 
Watkins [Watkins 89] and the classifier system (CS) 
Bucket Brigade, introduced by Holland [Holland 85]. 
The first part of the paper introduces and compares the 
two methods. The second part of the paper discuss the 
properties of reinforcement learning, as demonstrated by 
the example algorithms, their weaknesses, and their role 
within the space of various learning approaches. 

2 A Definition of Reinforcement 
Learning 

Reinforcement learning (RL) addresses the problem of 
learning a mapping (also called a policy or action map) 
between all of the states the system can be in and all of 
the actions it can execute using the reward and punish- 
ment received from the world. The goal is for the agent 
to learn to select the right action in each state. The in- 
puts to the learner are the state of the world and the 
reinforcement signal, and its outputs are actions. Per- 
fect state information is assumed. The problem is to 
find a function which closely enough approximates the 
mapping between all of the states the agent can perceive, 
and all of the actions it can take. The method is some 
form of search of the space of possible functions. RL 
algorithms have been used to learn tasks ranging from 
pushing a box without getting stuck to balanced walking 
and navigating through a maze. 

The following is the general form of an RL algorithm 
[Kaelbling 90]: 

1. Initialize the learner's internal state J to Iq. 

2. Do Forever: 

a. Observe the current world state s. 

b. Choose an action a = F(I,s) 
using the evaluation function F. 

c. Execute action a. 

d. Let r be the immediate reward for 
executing a in world state s. 

e. Update the internal state I = U(I, s, a, r) 
using the update function U. 



The internal state, 7, encodes the information the learn- 
ing algorithm saves about the world, usually in the form 
of a table maintaining state and action data. The up- 
date function U adjusts the current state based on the 
received reinforcement, and maps the current internal 
state, input, action, and reinforcement into a new inter- 
nal state. The evaluation function F maps an internal 



state and an input into an action based on the infor- 
mation stored in the internal state. The different RL 
algorithms vary in their definitions of U and F. 

The above framework assumes that, at each time step, 
the agent receives immediate reinforcement, the com- 
plete information about the value of the last action it 
took. In the general case, reinforcement can be arbi- 
trarily delayed, and the problem of assigning reward or 
punishment to a state based on delayed reinforcement is 
termed temporal credit assignment The first statement 
of the problem is due to [Samuel 59] , whose checkers- 
learning program dealt with deciding which moves to 
reward for eventually leading to "a triple jump." 

Temporal credit can be assigned in two ways: either 
the reward is appropriated to all of the state-action pairs 
after it is received, or an expected value of the future re- 
ward is calculated and maintained incrementally. The 
latter approach leads to a class of delayed reinforce- 
ment algorithms termed temporal difference (TD) meth- 
ods which assign credit locally based on the difference 
between temporally successive predictions [Sutton 88]. 
Both Q-learning and the Bucket Brigade are instances 
ofTD. 

While temporal credit assignment deals with propa- 
gating reward backward in time, structural credit as- 
signment deals with propagating the reward across sim- 
ilar states in order to couple them with similar actions. 
All RL approaches rely on exploring the complete state 
space, which is exponential in the size of the input vec- 
tor. Consequently, input generalization, the ability to 
collapse similar input states together, is critical in mak- 
ing the approach computationally feasible. 

Specific methods for dealing with both temporal and 
structural credit assignment will be described and ana- 
lyzed in subsequent sections. 

3 Q-learning 

Q-learning is a reinforcement learning algorithm based 
on delayed reinforcement [Watkins 89]. The goal of 
the algorithm is to, at each time step, maximize Q(s, a), 
the expected discounted reward of taking action a in the 
input state s. The algorithm maintains and updates a 
table of Q values, one for each state-action combination. 
The utility E of any state is the maximum Q value of 
all actions that can be taken in that state. The Q value 
of doing an action in a state is defined as the sum of 
the immediate reward r and the utility E(s') of the next 
state s' according to the state transition function T, dis- 
counted by the parameter 7. 



Formally: 



*' «- T(s,a) 

E(s) = max fl Q(s, a) 

Q{s, a) = r + yE(s'), < 7 < 1 



Q values are updated by the following rule: 

Q(s, a) - Q(s, a) + (3(r + jE(s') - Q{s, a)) 
<P< 1 



An RL algorithm using Q-learning has the following 
form: 

1. Initialize all Q(s,a); select «o- 

2. Do Forever: 

a. Observe the current world state s. 

b. Choose an action a that maximizes Q(s, a). 

c. Execute action a. 

d. Let r be the immediate reward for 
executing a in state s. 

e. Update Q(s, a) according to the rule above. 
Let the new state be «' ♦— T(s,a). 

The key drawbacks of Q-learning are its sensitivity 
to the parameter values and the reinforcement function, 
and its time and space complexity mandated by the state 
space and the Q table which must be maintained. 

The choice of (3 and 7, the key parameters in Q- 
learning, affects the efficiency of the learner. /3 deter- 
mines the learning rate; (3 = 1 results in an update rule 
which disregards all history accumulated in the current 
Q value. It resets Q to the current sum of the received 
and expected reward at every time step, which usually 
causes the algorithm to oscillate. 

7 is the discount factor for future reward. Ideally, 7 
should be as close to 1 as possible so that the relevance 
of future reward is maximized. In a deterministic world, 
7 can be set to 1, but in the general case two algorithms 
with 7 = 1 cannot be compared since, in the limit, the 
expected future reinforcement of both will go to 00. 

The initial Q values can affect the speed of conver- 
gence. Intuitively, if the table is initialized close to the 
optimal policy, this will speed up the learning process. 
Of course, the optimal Q values are not known a pri- 
ori. If initialized to in a problem whose optimal policy 
has positive final Q values, the algorithm will converge 
to the first positive value, never exploring other possi- 
bilities [Kaelbling 90]. This can be remedied by occa- 
sionally performing a random action to guarantee that 
the entire action space is eventually explored. A better 
solution is to initialize the Q values to be higher than 
their anticipated optimal values and gradually decrease 
them. 

Q-learning is sensitive to the coupling between the ini- 
tial Q values and the reinforcement function. If the func- 
tion is positive and the table is initialized to values ex- 
ceeding the optimal policy, the system will take longer 
to converge than if the reinforcement function contains 
some negative signals. 

Finally, the convergence of Q-learning requires a large 
number of trials, i.e. the algorithm relies on an infi- 
nite number of visits to the same state (for the proof 
see [Watkins 89]). This is a key drawback of classical 
Q-learning: it takes too long to converge for any non- 
trivially sized input vector. 

4 Reinforcement Learning in Classifier 
Systems 

We now turn to another instance of RL which, on the sur- 
face, appears rather different from Q-learning but shares 
some critical similarities. A classifier system (CS) is 



an adaptive production rule system consisting of a fixed 
number of condition-action pairs called classifiers [Hol- 
land 86]. The conditions are encoded as fixed-length 
bit strings over the alphabet {0,1, #}, where # is the 
default or "don't care" symbol. The action of a classifier 
consists of posting its message to a global board, which 
may result in an action to be performed in the world. At 
each time step, the messages on the board are matched 
to the conditions of all classifiers in parallel, and all sat- 
isfied classifiers make bids to post their messages to the 
board next. The highest bidders win and their messages 
are posted. 

Classifier systems perform two types of learning: what 
classifiers to have (classifier generation) and what clas- 
sifiers to activate (classifier reinforcement). Genetic al- 
gorithms are a class of methods for classifier generation. 
They employ mutation and crossover on the classifier 
population in order to, over time, evolve increasingly 
more "fit" classifiers. Widely discussed in the literature 
(e.g. [Goldberg 89]Goldberg85), genetic algorithms 
will not be addressed here. Instead, we will concen- 
trate on classifier reinforcement, the process of assigning 
strengths to classifiers based on the reward they receive 
over time. 

4.1 The Bucket Brigade Algorithm 

The Bucket Brigade is a temporal differencing reinforce- 
ment learning algorithm for propagating reward down a 
chain of classifiers. Whenever reward is received, it is di- 
vided among the classifiers whose firing enabled it. Since 
reward is not received at every time step, the strength of 
a classifier is adjusted based on its "distance" from the 
reward. The closer the classifier in the chain to the re- 
ward, the more strength it receives. The classifiers whose 
firing was immediately followed by reinforcement divide 
the reward. Next, the classifiers that enabled them re- 
ceive a smaller share of the reward, and so on down the 
chain. 

Initially, all classifiers are assigned equal strength S. 
When a classifier C matches a message on the board, it 
posts a bid B proportional to its strength and its speci- 
ficity. The specificity H of a classifier is the ratio be- 
tween the number of specified (non-#) bits and the to- 
tal number of bits in the classifier's condition. If C wins 
the bidding, it gets to post a message to the board next, 
and its strength is decreased by the magnitude of its bid. 
If its message causes an external action, its strength is 
increased by a portion of the received reinforcement r. 

Besides by immediate reinforcement, a classifier's 
strength is increased by the bids of its successors. If 
a classifier C posts a message which is, in the next time 
step, matched by another classifier C", and C then wins 
a bid, the strength of C is increased by the amount of 
C"s bid. If multiple classifiers contributed to C"s match, 
they split the bid evenly. 

Formally: 

M(C) — number of messages matched by C 

n = condition length in bits 

H(C) = (number of non-#'s)/n 

B{C,t) = cH(C)S(C,i), where < c < 1 



Classifier strength is updated by the following rule: 
S(C,t + l)<r-S(C,t) + r-B{C,t) + B(C',t + l)/M(C') 

An RL algorithm using the Bucket Brigade has the fol- 
lowing form: 

1. Initialize all S(C); select some C and post its 
message m on the board. 

2. Do Forever: 

For each message m on the board: 

a. Match m to all classifiers. 

b. Compute B(C) for all C that match m; 
select the winners. 

c. Post the winners' messages m new on the board. 

d. Let r be the immediate reward for posting m new . 

e. Update S(C) according to the rule above. 

4.2 Long Classifier Chains 

Typical for production rule systems, the number of rule 

firings (or classifier activation) required to connect an 
input and an output of a classifier system can be un- 
bounded [Kaelbling 90]. The longer the classifier se- 
quence, the longer it takes to update the strengths of the 
early classifiers, i.e. the more times the system must go 
through the same classifier sequence. This makes Bucket 
Brigade systems slow to adapt to changes in the environ- 
ment. 

In order to speed up learning, [Holland 85] suggests 
the use of a "bridging" or "epoch marking" classifier that 
is activated by the first classifier in a sequence, and re- 
mains active until the end of the sequence when exter- 
nal reinforcement is received. At this time, the epoch 
marker receives a large amount of reinforcement. When 
the chain is activated again, the epoch marker passes 
some of its strength directly to the first classifier in the 
sequence. Since its strength is high, the fraction it passes 
on to the front of the chain significantly upgrades the 
strength of the first classifier. This speeds up reward 
propagation from a long chain to a single step. 

Classifier sequences can be divided into two main 
types: reflex and non-reflex. Reflex sequences are simple 
chains in which each of the classifiers is activated solely 
by the message of its predecessor. Non-reflex sequences 
contain classifiers which are activated by more than one 
predecessor, i.e. more than one condition matches the 
posted message. [Riolo 87] shows that strengths of non- 
reflex classifiers fall off exponentially with the length of 
the chain. He gives experimental evidence that the use 
of bridging classifiers greatly expedites strength learning 
in both reflex and non-reflex sequences. 

4.3 Default Hierarchies 

The classifier system solution to decreasing the size of 
the state space is by categorizing states into abstrac- 
tion hierarchies. The categorization emerges from the 
presence of #, the "don't care" symbol in the classifier 
condition alphabet which allows for different levels of 
rule specificity in the system. The more #'s a classifier 
contains, the more general it is (e.g. (1 #) is more 
general than (1 1)), and the more conditions it will 
match. The rules which match the same conditions and 



are more specific, are termed "exceptions" and belong to 
a subset of conditions matched by a more general par- 
ent. Over time, as more rules are added to the system, 
a hierarchy emerges in which increasingly more specific 
rules serve as exceptions to the more general classifiers. 
Through the bidding system, the more general rules are 
less likely to be correct, causing the formation of more 
specific rules. The system of default hierarchies allows a 
CS to learn incrementally. 

An alternative to emergent hierarchies is to categorize 
the states by hand. [Wilson 87] suggests such as ap- 
proach. Although no general method for constructing a 
hierarchy is given, the more domain specific knowledge 
is employed the more useful the hierarchy structure can 
be made. 

5 Q-learning vs. the Bucket Brigade 

Reinforcement learning is a form of gradient descent (or 
hill climbing) in parameter space. Specifically, the goal 
of RL algorithms is to minimize the parameter error, 
i.e. to maximize received reinforcement over time. Both 
Q-learning and the Bucket Brigade are gradient descent 
strategies. They both perform temporal and structural 
credit assignment by keeping track of combinations of 
states and actions. Both perform search in the space 
of "strength" functions mapping states to actions. In 
order to compare their performance, we next introduce 
a special case of the standard Bucket Brigade algorithm. 

Classifier systems couple two types of learning: rein- 
forcement learning and genetic learning. The reinforce- 
ment learning (Bucket Brigade) orders the classifiers by 
strength S. The genetic learning discards the weakest 
of the classifiers, and generates new ones by applying 
mutation and crossover operations on the strongest. In 
order to compare Bucket Brigade to Q-learning, the re- 
inforcement portion of the CS needs to be isolated, so 
the genetic portion is removed. However, the purpose 
of the reinforcement part of CS is to provide strengths 
for the genetic learner. Since classifiers now cannot be 
added and removed from the system, they must all be 
supplied initially. The system is initialized with a set of 
classifiers C(s, a) such that s is the condition or the 71- 
bit input state, and a is the action. Consequently, there 
is a total of 2 n |a| classifiers, all of which are fully spec- 
ified. Thus, #'s are eliminated, and P, the measure of 
specificity of a classifier, is the same for all C's, so the 
P term is dropped from the bid equation. 

Q-learning and the Bucket Brigade both deal with the 
problem of propagating reward down a chain of states. 
The key difference is that Q-learning uses the maximum 
discounted future reward, whereas the Bucket Brigade 
computes the current reward, and then propagates it 
to the previous state. The following formalism allows 
for implementing Q's maximization within the Bucket 
Brigade: 

C = (s,a) where s is a state and a is an action. 
|C| = 2 n |a| and VC[P(C) = 1] 
S(s,a,t + 1) = S(s,a,t)+r- B{s,a,i) 
S(C) = S(C) + B{C) where C" «- T(C) 
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Figure 1: The learning task consists of 16 states, one of 
which is the goal. In each state three actions are possible: 
move forward, turn left, and turn right. Attempting an 
action against a boundary does not change the state. 
The shown state transition function implements simple 
motion so the task can be visualized as two-dimensional 
navigation. 



B(s,a,t) = 



■{ 



cS(C,t) if C(i) = max.S^.a.i) 
otherwise 



At each time step, the current classifier C"(*,a) re- 
ceives the immediate reinforcement, and pays the bid 
proportional to the strength of the best action to be 
taken from that state, i.e. the maximum strength clas- 
sifier that matches the current state s. This is the key 
change: instead of paying a bid proportional to its own 
strength S(s,a,t), C pays proportional to the strongest 
of the classifiers max* S(«, a, i) that match the current 
state. In the same time step, the predecessor C, whose 
message was matched by the current classifier C", re- 
ceives the value of C"s bid. 

This special case of the Bucket Brigade (SBB) imple- 
ments Q-learning in two steps. While Q-learning up- 
dates the Q value of the current state based on the max- 
imum of the next state, SBB updates the previous state 
with the maximum of the current state. 

SBB implements reflex sequences. Instead of bidding, 
the next action is selected so as to maximize the received 
reward (i.e. the one which uses a classifier with the most 
strength). Only one classifier is active at a time, unlike 
standard CS in which multiple classifier can compete in 
parallel. Furthermore, the current state receives the re- 
inforcement and passes it back to the previous state, so 
the bid is not shared but goes straight to the predecessor 
in the action chain. 

5.1 An Example 

The following learning problem was used for comparing 
Q-learning and SBB. The world consists of 16 states, one 
of which is the goal, and three possible actions (going 
forward, turning left by 90 degrees, and turning right 



by 90 degrees), all of which can be tried in all states. 
The state transition function is defined so that the prob- 
lem can be visualized as two-dimensional navigation in a 
row of four tiles, each of which contains four perceptual 
states. Attempting an action against a boundary does 
not change the state. Figure 1 illustrates the task and 
the state transition function. 

Action Selection: In both algorithms, actions were 
selected so as to maximize the Q or S value. In the goal 
state, a random action was selected in order to force the 
learner to escape the potential well which would keep it 
stuck at the goal where both Q (or S) and the received 
reinforcement are maximized. In order to converge to the 
optimal policy, the agent must explore the entire state 
space, rather than stay at the goal once it reaches it. It 
is not enough to select a random action with some small 
probability r. Unless r is relatively large, the accumu- 
lated probability of the agent escaping the potential well 
is too small to allow for learning the policy in a reason- 
able number of trials. 

Table Update: In order to propagate the strength 
values for each state-action combination, the entries in 
the table must be updated in one of two ways. Either a 
state is updated as it is visited by the agent (this is the 
implementation we chose), or the changes are propagated 
through the table for a chosen number of states at each 
time step. For example, [Mahadevan and Connell 
90] uses a five-step update process. The later solution 
speeds up the learning by a constant factor. Even if all 
of the states in the table are updated at each time step, 
the agent can still get stuck at the goal, illustrating that 
the update function is not related to the potential well 
problem. 

State Transition: The following state transition 
function was used: 



p(T(x - y)) = 



0.9 
0.1 



ifT(a) = y 
otherwise 



A random state transition was selected 10% of the time. 

The algorithms were tested on the same problem, 
the same optimal policy, two different parameter values, 
and three different reinforcement functions. The data 
plots show individual runs of the learning algorithms as 
crosses. The y-axis in each plot indicates the number of 
time steps to convergence to the optimal policy, while the 
x-axis shows the different values of the parameters being 
tested. For each set of runs of an algorithm with particu- 
lar parameter settings, the mean number of time steps to 
convergence is indicated with a bullet, and the standard 
deviation is shown with a vertical line. The algorithms 
showed sensitivity to the randomness inherent in both 
of the learning rules. This sensitivity was manifested by 
large standard deviations in almost all experiments. 

Q-learning Performance: Figure 2 illustrates the 
performances of Q-learning using three different initial 
values for the Q table, while figure 3 shows its perfor- 
mance on two different initial states. The performance of 
the algorithm is not significantly affected by either of the 
parameters. Although the measured performance varies 
in both the mean and the standard deviation, the varia- 
tion is not significant compared to the average variance 
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Figure 2: The graph illustrates the performance of Q- 
learning given three different initial values for the Q ta- 
ble. The smaller of the values was in range of the optimal 
policy, while the other exceeded it. 



between individual trials over a large number of runs. 
Figure 4 illustrates the algorithm's apparent insensitiv- 
ity to small changes in the learning rate parameter (3. 
As shown in figure 5, Q-learning is very sensitive to the 
value of 7, the future reward discount factor. This pa- 
rameter determines how much influence future reward 
will have on the current state. In deterministic worlds, 
such as the one used here, it is useful to set 7 close to 1 in 
order maximize the value of future information at each 
time step. Consequently, the higher value of 7 speeds up 
the learning. 

SBB Performance: Figure 6 illustrates the perfor- 
mance variation for two different initial S table values. 
The smaller of the values was in the range of the optimal 
policy, so a few of the initial S values were equal to their 
target values. Not surprisingly, this resulted in some- 
what faster mean convergence time and a significantly 
smaller standard deviation. However, in general it is not 
possible to have a good a priori estimate of the opti- 
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Figure 3: The graph shows the performance of Q- 
learning started in two different initial states. 



mal policy, thus making the problem of initializing the S 
values (and Q values) difficult. Figure 7 shows the algo- 
rithm's performance starting from three different initial 
states, one of which was the goal state. The plot shows 
no significant dependence on this parameter. Figure 8 
illustrates the algorithm's sensitivity to the value of c, 
the fraction of the received strength that is propagated 
to the previous classifier. The larger value of c increases 
the mean convergence time by over an order of magni- 
tude. The higher the value of c, the more weight is given 
to each trial, causing the algorithm to rely on the "local" 
decision and oscillate around the optimal policy before 
being able to converge on it. 

Comparison: Both algorithms were insensitive to 
initial Q and S values and start states, and sensitive 
to 7 and c, the parameters weighting the value of each 
time step. 7 and c can be viewed as duals of each other. 
A high value of 7 puts more importance on future tri- 
als. Similarly, a low value of c decreases the weight of 
the immediate S values, effectively increasing the impor- 
tance of future reward. The relative convergence times 
for Q-learning and Bucket Brigade were comparable for 
analogous scaling of those two parameters. 

In the shown trials, both algorithms were tested on a 
three-valued reinforcement function with small variance 
(r G {-1,0,5}), shown on the top of figure 9. When 
tested on an impulse function (r £ {0,3000}) shown on 
the bottom of figure 9, the performance of both algo- 
rithms declined by several orders of magnitude. How- 
ever, our experiments demonstrated that the standard 
classifier system configuration of the Bucket Brigade al- 
gorithm favors the impulse reinforcement function. 

Finally, figure 10 comparses the performance of Q- 
learning and the SBB algorithm using the same state 
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Figure 4: The graph plots the performance of Q-learning 
with two different values of /3, the learning rate. 
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Figure 5: The graph illustrates the performance of Q- 
learning on two different values of 7, the expected future 
reward discount factor. 



space, the same initial state and initial values for the Q 
and S tables, dual values of 7 and c (7 = 0.9, c = 0.1), 
and the same reinforcement function (3- valued). On the 
shown problem, the special case of the Bucket Brigade 
outperforms Q-learning by approximately an order of 
magnitude in the number of time steps required for con- 
vergence to the optimal policy. However, further experi- 
mentation showed that the use of even a slightly modified 
reinforcement function (e.g. scaling the function shown 
in figure 9 by one to (r G {0, 1, 6})) reversed the perfor- 
mance results. 

The two algorithms need to be tested on a much larger 
number of trials and on different learning problems be- 
fore conclusions can be made about their performance 
differences. Further, a characterization of the parameter 
interaction is needed for proper analysis. But observa- 
tion of the data alone illustrates the algorithms' similar, 
and similarly uncharacterized, sensitivity to the learning 
parameters and to the unavoidable randomness inherent 
in the approach. 



6 Input Generalization 

Our simple example of the two RL algorithms demon- 
strates that, even in a 16-state world, the number of 
trials to convergence is prohibitively large. Indeed, the 
exponential relationship between the size of the input 
vector and the size of the state space is a key problem 
in reinforcement learning. It introduces both temporal 
and spatial constraints on the size of the learning prob- 
lems that can be addressed. Specifically, Q-learning is 
a table-based scheme, which necessitates keeping statis- 
tics about all of the states, which results in a tremen- 
dous memory requirement. Additionally, the larger the 
state space, the slower the system will be in converging 
to the desired policy. Q-learning requires visiting all of 
the states infinitely many times which, for most realistic 
problems, takes too long, even in simulation, and unre- 
alistically long if the experiments are performed in the 
physical world. Finally, the larger the ratio between the 
number of states and the reinforcement, the slower the 
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Bucket Brigade with different initial values in the S table. 



learning will be. Sparse reinforcement aggravates the de- 
layed reinforcement problem and increases the number 
of trials required for the system to discover the correct 
policy. Pruning the state space by generalizing similar 
input states is one of the key methods for improving per- 
formance of table-driven RL approaches. 

Human programmers are excellent at generalization. 
The reason why it is much easier and faster, even for 
complex tasks, to hand code a behavior than to learn it, 
is that learning considers the entire state space of the 
problem whereas the human designer prunes it very ef- 
fectively. Usually, the problem of exponential state space 
is bypassed by a clever ordering of the rules, careful ar- 
bitration, and default conditions. None of these options 
are available in current RL approaches. 

[Chapman and Kaelbling 91] and [Mahadevan 
and Connell 00] present complementary approaches to 
input generalization. The Chapman-Kaelbling approach 
starts with the most general solution (a single, most gen- 
eral state) and splits it iteratively, based on statistics 
accumulated over time. When a bit in a state vector 
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Figure 7: The graph shows the performance of the 
Bucket Brigade algorithm on three different initial 
states, one of them the goal state. 



is found to be relevant, the state space is split into two 
subspaces, one with that bit on, and the other with it 
off. In contrast, the Mahadevan-Connell method starts 
with a fully differentiated, specific set of all states, and 
consolidates them based on similarity statistics accumu- 
lated over time. Both processes produce state space trees 
which are sufficiently differentiated but smaller than the 
fully exponential space. 

The default hierarchies of the CS paradigm are also a 
means of input generalization. Each instance of the # 
symbol allows for clustering two states into one, with the 
flexible grouping potential of full generality (all #'s) to 
full specificity (all non-#'s). Default hierarchies organize 
the specific rule instances in a system, and speed up the 
learning process, as previously described. If a CS starts 
with a single completely general classifier, it will gener- 
ate more specific rules over time. These rules will be 
grouped into a default hierarchy based on the relevance 
of individual bits which are changed from #'s to specific 
values. This process is neatly analogous to the [Chap- 
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Figure 8: The performance of Bucket Brigade with two 
different values of c, the bid scaling parameter. The 
larger value of c = 0.5 slows the learning down by more 
than an order of magnitude. 



man and Kaelbling 91] method of bit splitting which 
also begins with the most general state and subdivides 
and clusters the more specific states. The CS approach 
is a more powerful extension, since it eliminates the need 
for individual bit relevance. The process of generating 
more specific classifiers implements a power set of the 
state space. However, it does so incrementally thus never 
needing to keep around a large number of classifiers, as 
would a power set implementation of any bit-relevance 
algorithm akin to [Chapman and Kaelbling 01]. 

The genetic component of the classifier system, which 
generates new rules and eliminates weak ones, implic- 
itly implements the statistics that are kept explicitly in 
[Chapman and Kaelbling 01]. While the latter must 
apply some statistical analysis of the gathered data, CS 
simply works by trial and error until the proper popu- 
lation of appropriate specificity classifiers is evolved. It 
would be interesting to empirically compare the perfor- 
mance of the two methods on a common problem. 
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Figure 9: The top graph shows the 3- valued reinforce- 
ment function used for testing the two learning algo- 
rithms. The bottom graph shows the impulse function 
which worked well with the standard Bucket Brigade but 
significantly slowed down both Q-learning and SBB. 



The input generalization problem is also addressed by 
the connectionist literature [Hinton 00]. A key dif- 
ference between connectionist approaches and the RL 
paradigm is that while RL schemes can be entirely 
semantics-free and statistical, the generalization crite- 
ria are hand coded and therefore understood. In the 
connectionist case, the representations generated by the 
networks are not meaningful or usable for the designer. 
Furthermore, they cannot be debugged if the general- 
ization process fails; and the only solution is to tune 
the various parameters until the right generalization is 
found. The key difference, then, is that the general- 
isation in RL (both in Q-learning and CS) is built in, 
whereas in connectionist approaches it is a result of the 
network dynamics. 

Paradoxically, it is precisely the unwieldy, fully- 
exponential quality of the RL state spaces that gives 
them one of their main positive properties: asymptotic 
completeness. While hand coded reactive policies take 
advantage of the cleverness of the designer, they are al- 
most never provably complete. Most irrelevant input 
states are easily eliminated, but potentially useful ones 
can be overlooked. Complete state spaces, on the other 
hand, guarantee that the agent will, given sufficient time 
and sufficiently rich reinforcement, produce a provably 
complete policy. However, this quality is of little use if 
the world is dynamic or the state space is large. 

6.1 Modularisation 

The problem of learning the optimal policy can be cast 
as searching for paths in the action space which con- 



3000 r 



2300 



3000 



1900 



} 
I 



1000 



Figure 10: A plot comparing the performance of Q- 
learning (on the left) and the SBB algorithm (on the 
right) on the same learning problem. 



nect the current state with the goal. The longer the dis- 
tance between a state and the goal, the longer it takes 
to learn the policy or the path. This is why policies for 
large state spaces in Q-learning, and long classifier se- 
quences in the Bucket Brigade, both take a long time to 
be learned. Breaking the problem into modules or sub- 
problems effectively shortens the distance between the 
reinforcement signal and the individual actions. Conse- 
quently, the length of action sequences to be learned is 
decreased. However, breaking the problem up into an 
appropriate set of modules requires domain information 
about the particular learning task. 

[Mahadevan and Connell 90] give an example of 
breaking up a box pushing task into three modules, effec- 
tively introducing three subgoals into the learning task. 
The three are carefully chosen to be orthogonal and non- 
conflicting, based on the particular task. The robot's 
behavior repertoire is designed so that whatever state 
it is in, it is pursuing one of the subgoals: finding a 



box, pushing a box, or getting unstuck. The reinforce- 
ment depends on which of the subgoals is being pursued, 
but it is available more frequently, since the distance be- 
tween any state and one of the subgoals, is decreased. 
Not surprisingly, when tested in both simulation and on 
the real robot, the modular approach far outperforms the 
monolithic design in which the robot is only rewarded for 
actually maintaining contact with and pushing a box. 

It is unlikely that any universal strategy for dividing 
the task into modules exists. However, it would be use- 
ful to derive a few principles for task decomposition for 
particular classes of learning problems. Another inter- 
esting question is whether the modularisation of a task is 
dependent on the learning algorithm, i.e. whether there 
exists some "optimal" set of modules which is indepen- 
dent of the way the modules are learned, but is tied 
instead to the semantic definition of the problem. 

7 Built in Structure and Knowledge 

It is often said that "one cannot learn anything un- 
less one almost knows it already" [Winston 84]. The 
tradeoff between the type and amount of built in versus 
learned information is the key issue in machine learn- 
ing. The less structure is built in, the more is left to 
the algorithm to discover. Minimizing built in structure 
in order to ease the programming task and reduce the 
learning bias often results in over-specificity and nar- 
rowness. It makes the learning process slower, the space 
and time complexity larger, and the result more task- 
specific. Neural networks are an example of this type 
of data-driven learning, biased only by the structure of 
the network and the training set. These methods have 
been shown to be sensitive to initial conditions [Kolen 
and Pollack 90], very specific, and of limited ability to 
generalize [Hinton 90]. 

On the other end of the data-knowledge spectrum lie 
knowledge- based or knowledge-driven learning schemes. 
They employ some form of a domain theory in order 
to minimize the amount of deduction left to the agent, 
as well as the amount of new information needed from 
the world. Explanation based learning (EBL) [Dejong 
and M ooney 86] and explanation based generalization 
(EBG) [Mitchell et al 86], [Mitchell et al 89] be- 
long in this category. These approaches are constrained 
by the structure and amount of information provided by 
the domain theory, and rely on its completeness and ac- 
curacy. These properties have earned them the label of 
"strong" methods as compared with "weak" connection- 
ist approaches [Hinton 90]. 

Reinforcement learning is situated between the two ex- 
tremes of the spectrum, much closer to the data-driven 
end. Unlike both the connectionist approaches and EBL- 
style methods, which require an explicit teacher, RL is 
unsupervised. Consequently, it is well suited for adap- 
tive agents acting in changing, possibly nondeterministic 
worlds. Eliminating the teacher removes any bias that 
might be present in the training set. On the other hand, 
RL approaches rely on the environment to encode and 
manifest an observable and learnable mapping between 
the states the agent can perceive and the actions it can 
perform. The dependence on the environment rather 
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Figure 11: This figure illustrates the relationship be- 
tween the amount and type of experimental trials in 
different learning methodologies. The shaded area in- 
dicates the desirable properties for a learning algorithm. 



than the training set can be recast as the reliance on the 
designer to properly structure the perceptual apparatus 
and the reinforcement function. 

In order to avoid preprocessing the data, RL ap- 
proaches manipulate the "raw" input vector. To es- 
tablish a correlation between each state and the desired 
action, the algorithms search through the entire space 
of state-action combinations (0(2 n |a|)) requiring a large 
number of trials to find the optimal policy. In contrast, 
knowledge-driven learning approaches rely on very few 
carefully constructed examples, since they encode most 
of the domain knowledge in the system a priori. 

Intuitively, the amount of built in structure and the 
number of training trials are inversely proportional. Ad- 
ditionally, the quality of the gained information af- 
fects the required number of additional trials. While 
knowledge-driven systems require only a few special ex- 
amples, connectionist systems use a large number of 
somewhat biased examples, and RL systems depend on 
many trials which can vary in both accuracy and rele- 
vance (figure 11). 

8 Summary 

This paper analyzed the reinforcement learning problem 
with respect to two specific algorithms: Q-learning and 
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classifier systems using the Bucket Brigade. The princi- 
pal weaknesses of RL were discussed: the large time and 
space complexity, input generalization, and the lack of 
built in structure. 

A main problem with RL approaches is their "unstruc- 
tured" utilization of the inputs. Since no domain in- 
formation is used, the entire space of state-action pairs 
must be explored. Consequently, these algorithms scale 
poorly with the number of input bits. However, a wealth 
of sensory information is a key to intelligence, so any fu- 
ture directions in learning must be helped, rather than 
hurt, by increased amounts of information. 

Learning can serve at least two different purposes in 
a situated agent. It can ease the programmers job by 
having the agent learn its own behaviors. It can also 
keep the agent adaptive to a changing world. So far, RL 
has not fulfilled either of those roles. Learning is a poor 
substitute for programming any real system because it 
is overwhelmingly complex and slow. Additionally, not 
enough is known about the internal dynamics of the pa- 
rameter interaction, which demands a lot of parameter 
tuning. It is not yet clear that tuning learning parame- 
ters is easier than tuning programming parameters in a 
hand coded nontrivial agent. 

The adaptation property of learning agents is indis- 
putably useful. However, current RL algorithms are 
very slow to converge to a policy and consequently 
slow to adapt. Perhaps more importantly, no RL work 
so far has demonstrated the ability to use previously 
learned knowledge to speed up the learning of an en- 
tirely new policy. Instead, the agents must either start 
from scratch, or worse, the current policy may be a detri- 
ment to learning the next one. Consequently, it has not 
yet been shown that agents using RL can adapt to more 
than a single policy. 

In spite of its weaknesses, reinforcement learning has 
been demonstrated to perform well in certain types of 
tasks and environments. Better understanding those 
tasks, attempting a large number of versatile experi- 
ments of nontrivial agents, and further characterizing 
the real applications of the approach ought to be be the 
focus of further RL research. 
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